Resampling Methods for Unsupervised Learning from Sample Data
نویسنده
چکیده
Two important tasks of machine learning are the statistical learning from sample data (SL) and the unsupervised learning from unlabelled data (UL) (Hastie et al., 2001; Theodoridis & Koutroumbas, 2006). The synthesis of the two parts – the unsupervised statistical learning (USL) – is frequently used in the cyclic process of inductive and deductive scientific inference. This applies especially to those fields of science where promising, testable hypotheses are unlikely to be obtained based on manual work, the use of human senses or intuition. Instead, huge and complex experimental data have to be analyzed by using machine learning (USL) methods to generate valuable hypotheses. A typical example is the field of functional genomics (Kell & Oliver, 2004). When machine learning methods are used for the generation of hypotheses, human intelligence is replaced by artificial intelligence and the proper functioning of this type of ‘intelligence’ has to be validated. This chapter is focused on the validation of cluster analysis which is an important element of USL. It is assumed that the data set is a sample from a mixture population which is statistically modeled as a mixture distribution. Cluster analysis is used to ‘learn’ the number and characteristics of the components of the mixture distribution (Hastie et al., 2001). For this purpose, similar elements of the sample are assigned to groups (clusters). Ideally, a cluster represents all of the elements drawn from one population of the mixture. However, clustering results often contain errors due to lacking robustness of the algorithms. Rather different partitions may result even for samples with small differences. That is, the obtained clusters have a random character. In this case, the generalization from clusters of a sample to the underlying populations is inappropriate. If a hypothesis derived from such clustering results is used to design an experiment, the outcome of this experiment will hardly lead to a model with a high predictive power. Thus, a new study has to be performed to find a better hypothesis. Even a single cycle of hypothesis generation and hypothesis testing can be time-consuming and expensive (e.g., a gene expression study in cancer research, with 200 patients, lasts more than a year and costs more than 100.000 dollars). Therefore, it is desirable to increase the efficiency and effectiveness of the scientific progress by using suitable validation tools. An approach for the statistical validation of clustering results is data resampling (Lunneborg, 2000). It can be seen as a special Monte Carlo method that is, as a method for
منابع مشابه
Using Unsupervised Learning to Guide Resampling in Imbalanced Data Sets
The class imbalance problem causes a classier to overt the data belonging to the class with the greatest number of training examples. The purpose of this paper is to argue that methods that equalize class membership are not as e ective as possible when applied blindly and that improvements can be obtained by adjusting for the within-class imbalance. A guided resampling technique is proposed and...
متن کاملA Resampling Technique for Relational Data Graphs
Resampling (a.k.a. bootstrapping) is a computationallyintensive statistical technique for estimating the sampling distribution of an estimator. Resampling is used in many machine learning algorithms, including ensemble methods, active learning, and feature selection. Resampling techniques generate pseudosamples from an underlying population by sampling with replacement from a single sample data...
متن کاملروشهای بازنمونهگیری بوت استرپ و جک نایف در تحلیل بقای بیماران مبتلا به تالاسمی ماژور
Background and Objectives: A small sample size can influence the results of statistical analysis. A reduction in the sample size may happen due to different reasons, such as loss of information, i.e. existing missing value in some variables. This study aimed to apply bootstrap and jackknife resampling methods in survival analysis of thalassemia major patients. Methods: In this historical coh...
متن کاملA Comparison of Resampling Methods for Clustering Ensembles
Combination of multiple clusterings is an important task in the area of unsupervised learning. Inspired by the success of supervised bagging algorithms, we propose a resampling scheme for integration of multiple independent clusterings. Individual partitions in the ensemble are sequentially generated by clustering specially selected subsamples of the given data set. In this paper, we compare th...
متن کاملDeep Unsupervised Domain Adaptation for Image Classification via Low Rank Representation Learning
Domain adaptation is a powerful technique given a wide amount of labeled data from similar attributes in different domains. In real-world applications, there is a huge number of data but almost more of them are unlabeled. It is effective in image classification where it is expensive and time-consuming to obtain adequate label data. We propose a novel method named DALRRL, which consists of deep ...
متن کامل